Method for Independent Estimation of the False Localization Rate for Phosphoproteomics

Phosphoproteomic methods are commonly employed to identify and quantify phosphorylation sites on proteins. In recent years, various tools have been developed, incorporating scores or statistics related to whether a given phosphosite has been correctly identified or to estimate the global false localization rate (FLR) within a given data set for all sites reported. These scores have generally been calibrated using synthetic datasets, and their statistical reliability on real datasets is largely unknown, potentially leading to studies reporting incorrectly localized phosphosites, due to inadequate statistical control. In this work, we develop the concept of scoring modifications on a decoy amino acid, that is, one that cannot be modified, to allow for independent estimation of global FLR. We test a variety of amino acids, on both synthetic and real data sets, demonstrating that the selection can make a substantial difference to the estimated global FLR. We conclude that while several different amino acids might be appropriate, the most reliable FLR results were achieved using alanine and leucine as decoys. We propose the use of a decoy amino acid to control false reporting in the literature and in public databases that re-distribute the data. Data are available via ProteomeXchange with identifier PXD028840.


Supplementary Figure 4:
Comparison of minimum distance between phosphorylated STY and the nearest target amino acid (Ala, Leu, Gly, Asp, Glu and Pro), compared to the STY distribution, searching PXD007058 (Synthetic data set).

Supplementary Figure 7:
Comparison of FLR estimation searching PXD008355 (Arabidopsis data set) using different decoy amino acids: pAla, pGly, pLeu, pAsp, pGlu and pPro (TPP, fully tryptic, 1 %FDR). a) all PSMs; b) FLR ≤ 0.05; c) all PSMs with "no-choice" hits removed; d) FLR ≤ 0.05 with "no-choice" hits removed. Figures a & b are shown in the main body of the manuscript and are repeated here for comparison.

Supplementary Table 5:
Comparison of amino acid frequency ratios between STY and the decoy amino acid for the identified peptides, identified phosphopeptides and the search database

i) Investigating high-scoring false hits
When searching the PXD008355 Arabidopsis database with TPP using different decoy amino acids, multiple high scoring false localisations could be seen. When these were investigated further, it was found that these wrong hits contained the same number of potential phosphosites as identified phosphosites. These wrong hits can therefore be categorised as "no-choice" PSMs as there is no choice for localisation, and thus they are wrong because the search engine result is incorrect, not the site localisation algorithm. This may indicate that the search engine and PSM scoring is producing overconfident estimates of probability. These "no-choice" hits were removed and the FLR estimations recalculated, resulting in an improvement in the FLR estimations for each method. S-9

ii) Amino acid frequency analysis
In order to try to determine the cause of the differences seen with glycine, the amino acid frequencies of the decoys used were compared across the identified peptides, phosphopeptides and search databases. It could be seen that the frequencies across peptides and phosphopeptides of Ala and Gly, compared to STY were similar across the data sets. Although differences were seen between the search databases and peptide/phosphopeptide frequencies, the database frequencies showed fairly similar frequencies across all decoy amino acids. The analysis of the amino acid frequencies was therefore unable to determine an explanation for Gly being seen as an outlier.